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Fast Convergence Rates for Distributed 
Non-Bayesian Learning 

Angelia Nedic, Alex Olshevsky and Cesar A. Uribe* 


Abstract —We consider the problem of distributed learning, 
where a network of agents collectively aim to agree on a 
hypothesis that best explains a set of distributed observations 
of conditionally Independent random processes. We propose a 
distributed algorithm and establish consistency, as well as a 
non-asymptotic, explicit and geometric convergence rate for the 
concentration of the beliefs around the set of optimal hypotheses. 
Additionally, if the agents interact over static networks, we 
provide an improved learning protocol with better scalability 
with respect to the number of nodes in the network. 

Index Terms —Distributed algorithms. Algorithm design and 
analysis, Bayes methods. Learning, Estimation. 

I. Introduction 

Large numbers of interconnected components add to the 
complexity of engineering systems. Developing models and 
tools for the analysis of such distributed systems is necessary, 
not only from the engineering point of view hut for effective 
decision-making and policy design. For example, the control of 
autonomous vehicle for exploration, rescue, and surveillance 
depends on the coordination abilities of fleets of robots; each 
robot should make decisions based on local information and 
limited communications. Power networks (e.g. the electric 
grid) need several generating and consuming stations to co¬ 
ordinate offer and demand to improve efficiency. In traffic 
control, the goal is to distributively avoid jams and to in¬ 
crease traffic flow based on limited infrastructure (e.g. roads). 
Economic systems need modeling, estimation and control of 
markets at the micro and macroeconomic scales. Market dy¬ 
namics depend on several agents influencing the system, each 
of which might have conflicting goals. In telecommunication 
networks, several stations need to communicate over non¬ 
perfect channels to optimize information transmission. The 
control of industrial processes requires communication and 
coordination between different parts of the process in haz¬ 
ardous environments. The modeling and control of ecological 
systems requires the analysis of several actors interacting with 
each other, subject to changing environments. 

Traditional approaches for the design of distributed infer¬ 
ence algorithms, for inherently distributed systems, assume 
a fusion center exists. The fusion center gathers all the 
information and makes centralized decisions ID, 0, a, ll. 
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Nonetheless, communication constraints, limited memory and 
lack of physical accessibility to certain measurements hinders 
this task. Therefore, it is necessary to develop algorithmic 
protocols that take into account such constraints and use only 
locally available information. Although many results on these 
themes have appeared in recent years, the study of distributed 
decision-making and computation traces back to classic papers 
from the 70s and 80s Q, 0, Q, 0, 0, IH, HD- 

In 1121, the authors describe results on learning in social 
networks based on computing posterior distributions using 
Bayes’ rule. That is, given some assumed prior knowledge 
and new observations, an agent computes a posterior based 
on likelihood models, see ifTSll . Nevertheless, a fully Bayesian 
approach might not be possible because full knowledge of the 
network structure, or other agents’ likelihood models, need 
not be available m, Qa. Other authors showed that non- 
Bayesian methods can be used in learning task as well 100 . 
El, m, El- In this case, agents are assumed to be 
boundedly rational (i.e. fail to aggregate information in a fully 
Bayesian manner ||20|| ). They repeatedly communicate with 
others and use naive approaches to aggregate information. 

Several groundbreaking papers have described distributed 
methods to achieve global behaviors by repeatedly aggregating 
local information without complete knowledge of the net¬ 
work El, ED, 1221, El- For example, in distributed hy¬ 
pothesis testing using belief propagation, convergence and its 
dependence on the communication structure were shown 123 . 
Later, extensions to finite capacity channels, packet losses, 
delayed communications and tracking were developed El, 
El. In ED, the authors proved convergence in probability, 
the asymptotic normality of the distributed estimation and 
provided conditions under which the distributed estimation 
is as good as a centralized one. Later in El, the almost 
sure convergence of a non-Bayesian rule based on arithmetic 
mean was shown for fixed topology graphs. Extensions to 
information heterogeneity and asymptotic convergence rates 
have been derived as well ITSl . Following ITTl . other methods 
to aggregate Bayes estimates in a network have been explored. 
In l26l . geometric means are used for fixed topologies as 
well, however, the consensus and learning steps are separated. 
The work in l27]| extends the results of El to time-varying 
undirected graphs. In 1191 . local exponential rates of con¬ 
vergence for undirected gossip-like graphs are studied. The 
authors in ESI, ED, ED, E3 proposed a non-Bayesian 
learning algorithm where a local Bayes’ update is followed 
by a consensus step. In l28l . convergence result for fixed 
graphs is provided and large deviation convergence rates are 
given, proving the existence of a random time after which 
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the beliefs will concentrate exponentially fast. In 1291 . similar 
probabilistic bounds for the rate of convergence are derived for 
fixed graphs and comparisons with the centralized version of 
the learning rule are provided. Other variations of the non- 
Bayesian approach have been proposed for continuum set 
of hypotheses ED, weakly connected graphs bisection 
search algorithms ES, transmission node failures El, ES, 
and time-varying graphs IH, E3, E3- See ||40l, iHTl 
for an extended literature review. 

In this paper, we consider a network of agents, where each 
agent repeatedly receives information from its neighbors as 
well as private signals from an external source. The private 
signals are realizations of a random variable with an unknown 
distribution. The agents would like to collectively agree on a 
hypothesis (distribution) that best explains the data observed 
by all nodes/agents. We focus on the case where agents might 
have inconsistent hypotheses, in the sense that, the hypotheses 
that best describe private observations need not be the same 
as the hypotheses that best describe the aggregated set of 
observations of all agents. 

The contributions of this paper are: first, we propose and 
motivate a novel distributed non-Bayesian learning rule. We 
derive the proposed algorithm as the solution of a natural 
extension of the variational representation of Bayes’ updates 
in a distributed setting. This characterizes a general family of 
distributed non-Bayesian learning protocols. We show that ex¬ 
isting protocols are instances of this general family algorithms. 
Additionally, we show that the proposed protocol allows the 
network to learn the set of hypotheses that best explain the data 
collected by all the nodes (i.e. consistency). We also provide 
a geometric, non-asymptotic, and explicit characterization of 
the convergence rate, which immediately leads to finite-time 
bounds that scale intelligibly with the number of nodes for 
general time-varying undirected graphs. Finally, we propose 
and analyze a new protocol for arbitrary fixed undirected 
graphs that scales better than previous algorithms with respect 
to the number of agents in the network. 

Simultaneous and independent works obtained results which 
overlap with ours 1291 . 1281. Specifically, in 1^ the authors 
proposed a variant of a distributed learning algorithm, a similar 
convergence rate was obtained. Consistency and asymptotic 
rates were provided in ll28l for another class of non-Bayesian 
learning. Moreover, specific instances of the problem studied 
in this work have been considered in the context of distributed 
parameter estimation El, ED- We note that, relative to 
these simultaneous papers, our results are more general in 
the sense that they allow time-varying networks and allow 
nodes to have conflicting hypotheses, none of which matches 
the distribution of the observations. Furthermore, in the case 
of fixed undirected graphs, we propose an update rule which 
involves an additional register of memory in each node to 
obtain a more graceful scaling with the number of nodes in 
the network. Section |III] provides a more detailed comparison 
with the mentioned papers. 

This paper is organized as follows. In Section [III we 
describe the problem and main results. In Section [Till we 
introduce a general class of distributed non-Bayesian learn¬ 
ing rules and provide comparisons with recent literature. In 


Section IIVI we analyze the consistency of the information 
aggregation and estimation models, while in Section |V] we 
prove a non-asymptotic convergence rate for the concentration 
of the beliefs generated by the proposed algorithm for time- 
varying graphs. In Section IVIl we show the convergence time 
improvement for a new protocol for fixed undirected graphs. 
Section IVTIl develops the application of the proposed methods 
for the problem of distributed source localization. Conclusions 
and future work directions are discussed in Section IVIIII 

Notation: We use upper case letters to represent random 
variables (e.g. Xk), and the corresponding lower case letters 
for their realizations (e.g. Xk)- We write \A\ij or Aij to 
denote the entry of the matrix A in the i-th row and j-th 
column. We write A' for the transpose of a matrix A and x' 
for the transpose of a vector x. We use /„ for the identity 
matrix of size n by n. Bold letters represent vectors which 
are assumed to be column vectors unless specified otherwise. 
The z-th entry of a vector will be denoted by a superscript 
i, i.e., Xk = [x^,... ,x'^y. We write 1„ to denote the all- 
ones vector of size n. For a sequence of matrices {Ak}, we 
let Ak;:ki = Akf ■ ■ ■ Ak,+iAk, for all kf > h > 0. We 
abbreviate terminology almost surely by a.s. and independent 
identically distributed by i.i.d.. In general, when referring an 
agent i we will use superscripts and when referring to a time 
instant k we will use subscripts. 

II. Problem Setup and Main Results 

Consider a group of n agents, indexed by 1,2, ...,n, 
each having observations of conditionally independent random 
processes, at discrete time steps k = 1,2,3,.... Specifically, 
agent i observes the random variables Sl,S 2 , ■ ■ ■, which are 
i.i.d. and distributed according to an unknown probability 
distribution /*. The output space of the random variables is 
a finite set which we will denote by 5*. For convenience, we 
stack up all the into a vector denoted as Sk- Then, is 
an i.i.d. vector taking values in 5 = ri"=i *5* distributed 
as / = n”= /*. Furthermore, each agent i has a family of 
probability distributions {f*(-|0)} parametrized by a finite set 
0 = {01,02,..., 0m} with m elements. One can think of 0 as 
a set of hypotheses and C(-|0) as the probability distribution 
that would be seen by agent i if hypothesis 0 were true. We do 
not require that there exists 0 G 0 with £*(-|0) = /* almost 
everywhere for all i = 1,... ,n; in other words, there may 
not be a hypothesis that matches the observations made by 
the nodes. Rather, the objective of all agents is to agree on a 
subset of 0 that best fits all the observations in the network. 

Formally, this setup describes the scenario where the group 
of agents collectively tries to solve the following optimization 
problem 

y^nFie)^DKL{mm) 

n 

= Y.DKLifwm) (1) 

i=l 

where Dkl (/*IK* (’I^)) Kullback-Leibler (KL) divergence 
between the distribution of and £®(-|0). The distributions 
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/®’s are unknown, therefore the agents try to “learn” the solu¬ 
tion to this optimization problem based on local observations 
and interactions, see Figure [T] 



Fig. 1. Geometric interpretation of the learning objective. The triangle 
represents the simplex of all possible probability distributions of Sk- The 
point / is the actual distribution of Sk - The goal of the network of agents is to 
learn the hypothesis 9* that best describes its observations, which corresponds 
to the distribution t,(-\9*) (the closest to the distribution /). 


Consider for example a group of two agents, labeled by 
1 and 2, such that S], ~ 1), which is equivalent to 

S\ = i + Wl where Wl ^ A/'(0,1) is a zero mean Gaussian 
process with unitary standard deviation. They want to correctly 
identify the parameter 6 * out of three possible hypotheses 
0 = { 0 i^ 62 , 0 ^} where the likelihood models of the agents 
are: = (j){s^ — 0-5), £^(s^|0i) = (j){s‘^), i^{s^\ 02 ) = 

— 1.5), -^^(5^102) = — 2.5), 

— 1-5) where = exp(—ix^)/-\/^ 
is the probability density function of the standard normal 
distribution. In this scenario, agent 1 alone would not be 
able to differentiate between 9i and 02 and agent 2 cannot 
differentiate between 02 and 03 given that they are at the 
same distance to the true distribution of the observations. 
Nonetheless, when they interact with each other the solution 
to the proposed optimization problem is 0* = 02 - 


A. Proposed Learning Algorithms 

Probability distributions over the hypothesis set 0 will be 
referred as beliefs. Every agent i has an initial belief pf, which 
we often refer to as its prior distribution or prior belief. We 
will be studying the dynamics wherein agents exchange beliefs 
with their neighbors over some communication network, with 
the effect that over time these beliefs concentrate on the “best” 
choice of hypotheses. Each agent i generates a new belief for 
time fc + 1, which we will denote by based on its current 

belief p\, an observation of the random variable 5'^_|_3, 
and the current beliefs of its neighbors p^. with j i. We 
propose two algorithms for the generation of the new belief 
p\j^i'. a generic rule for undirected time-varying graphs and 
a special rule for static graphs. We show that the proposed 
update rules generate a sequence of beliefs that sequentially 
approaches a solution to the optimization problem in (|T]l. 


We consider the following rule for general undirected time- 
varying graphs: for each 0 € 0, 

1 ” 

(4^^|0)/5i (2) 


where is a normalization factor to make the beliefs a 

probability distribution, i.e., 

771 1% 

p=ij=i 

where the Ak is a non-negative matrix of “weights”, which 
is compliant with the connectivity structure of the underlying 
communication network. The network at each time instant k 
is modeled as a graph Qk is composed by a node set V = 
{1, 2,..., n} and a set of undirected links. The variable 
/3^ is a stationary Bernoulli random process with mean g*, 
which indicates if an agent obtained a new realization of 
Specifically, /3^ = 1 indicates that agent i obtained a new 
observation, while P\ = 0 indicates that it did not. 

Eor static undirected graphs, we propose a new belief update 
rule with one-step memory as follows: for each 0 in 0 




n 

1 j=i _ 

ft 

j=i ^ / 


where is the corresponding normalization factor given 

by 


^k+i - X! 

P=1 


j=f _ 

ft 

j=l ^ / 


where A is a specifically chosen matrix (called the lazy 
Metropolis matrix) and a a constant to be set later. We 
initialize pfi{9) to be equal to p}j{9) for all i = 
and 0 £ 0. We will show that this update rule generates a 
sequence of beliefs that concentrate at a rate a factor of n faster 
than the previous results. Note that the update rule described 
in Eq. (j^l requires the communication of the product of the 
beliefs and likelihood functions and an additional memory 
since the beliefs at time fc -|- 1 depends on the beliefs a time 
k and at time k — 1 . 

Section|ni]will motivate the choice of the update rules. They 
can be interpreted as natural generalizations of the variational 
representation of the Bayes update rule for the distributed 
learning setting. 


B. Assumptions and Definitions 

We will list a sequence of assumptions about the underlying 
communication graph and the family of parametrized likeli¬ 
hood models. They will guarantee the desired convergence 
properties. 

Eor the first class of update rules described in Eq. lO, we 
assume the following structure for the sequence of communi¬ 
cation graphs {Qk}. 
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Assumption 1 The graph sequence {Gk} o.nd the matrix 
sequence {Ak} are such that: 

(a) Ak is doubly-stochastic with > 0 if{i,j) e Ek. 

(b) If {i,j) ^ Ek for some i j then Aij = 0. 

(c) Ak has positive diagonal entries, > 0/or all i = 

(d) If [Ak] ij > 0, then > rj for some positive constant 

I- 

(e) {Gk} is B-strongly connected, i.e., there is an integer 
B >\ such that the graph | V, Ui=fes ^ ^ Ej^ is strongly 
connected for all k > 0. 

Assumption [Ha) and Assumption [Hb) characterize the 
communication between agents. If two agents can exchange 
information at a certain time instant k, the underlying com¬ 
munication graph will have an edge between the corresponding 
nodes. This also implies a positive weighting of the informa¬ 
tion shared. The graph sequence {Gk} and the matrix sequence 
{Ak} define a corresponding inhomogeneous Markov Chain 
with transition probabilities Ak- Assumption [He) guarantees 
the aperiodicity of this Markov Chain. Additionally, Assump¬ 
tions [Hd) and [He) guarantee that this Markov chain is ergodic 
by ensuring there is sufficient connectivity and that the entries 
of Ak do not vanish. Assumption [T] is common in distributed 
optimization and consensus literature Il42l . Il43l . It guarantees 
convergence of the associated Markov Chain and defines 
bounds on relevant eigenvalues in terms of the number of 
agents. 

There are several ways to construct a set of weights sat¬ 
isfying Assumption [H For example, one can consider a lazy 
Metropolis (stochastic) matrix of the form Ak = \ln + \Ak, 
where /„ is the identity matrix and Ak is a stochastic matrix 
whose off-diagonal entries satisfy 

max{4 + l,4-1-1} ’ 

0, if{i,j)iEk 

where d\. is the degree (the number of neighbors) of node 
i at time k. Note that the lazy Metropolis weights require 
undirected communications since each weight [Ak\ij depends 
on the degree of both agent i and agent j. Thus, we will 
require that agents share their beliefs as well as their degree, 
which means exchanging m + 1 numbers at each time step. 

Analogous to Assumption [T] we use the following assump¬ 
tion when the interaction between the agents happens over 
static graphs with the update rule described in Eq. Q. 

Assumption 2 The graph sequence {Gk} is static (i.e. Gk = G 
for all k) and undirected and the weight matrix A is a lazy 
Metropolis matrix, defined by 

^ = 2^" + 2^ 


[Ak 


with d® being the degree of the node i (i.e., the number of 
neighbors of i in the graph). 

Next, we provide three important definitions that we use in 
the sequel to describe some learning-related quantities. 

Definition 1 The group confidence of a nonempty subset 
W (fV of agents is given by 

cf (^) = - E (-1^)) ^ s 

i&W 

where 4 is the mean-value of the i.i.d. Bernoulli variable /3} 
characterizing the availability of measurements for agent i. If 
W = V, we simply write Cq. 

The group confidence provides a way to quantify the quality 
of a hypothesis from the perspective of a subset of the agents. 
The quality of a hypothesis for individual agents is weighted 
by the mean of the i.i.d. Bernoulli process governing the 
availability of observations. 

Definition 2 Tvvo distinct hypotheses Oi and Oj are said to be 
W-observationally equivalent if 0^ {9i) = (0j). 

This definition extends the idea of observational equivalence 
introduced in ini. Group observational equivalence provides a 
general definition where a group of agents can not differentiate 
between two hypotheses even if their corresponding likelihood 
models are not the same. 

Finally, we introduce the optimal set of hypotheses as the 
set with the maximum group confidence. 

Definition 3 The optimal hypothesis set is defined as 
0* = argmaxCq(0), and the confidence of the optimal hy- 
6»ee 

pothesis set is denoted as C*, i.e., Cq = Cq{9*) for 9* € 0*. 

The optimal set is always nonempty, and we assume it is a 
strict subset of 0 to avoid the trivial case where all hypotheses 
are observationally equivalent. This holds if there is a unique 
true state, 9 G &, such that each agent i sees distributions 
generated according to /® = £’'(■[9), and 0 contains other 
hypotheses besides 9. 

Informally, we will refer to our assumptions above as de¬ 
scribing a setup with conflicting models', by this, we mean that 
the hypothesis which best describes the observations of agent i 
(i.e., the hypothesis 9 which minimizes I?kl(/*P*(’|^*))) may 
not be the hypothesis which best describes the observations of 
a different agent, and may in fact not belong to the optimal 
set 0*. 

We will further require the following assumption on the 
agents’ prior distributions and likelihood functions. The first 
of these is sometimes referred to as the Zero Probability 
Property [HI. 


where A is the Metropolis matrix, which is the unique stochas¬ 
tic matrix whose off-diagonal entries satisfy 


A _ J max-f d^ + l.dZ-l-l V ’ 


if if j) e E 

if ii,j) i E 


Assumption 3 For all agents i = 1,... ,n, 

(a) The set 0* = is nonempty, where 0*® C 0* 

is the subset of optimal hypotheses with positive initial 
beliefs for agent i, i.e., uh(9) > 0 for all 9 G 0*® and 
pl{0)=Ofor all9Ge*\e*\ 
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(b) The support of the true distribution of the observations 
is contained in the support of the likelihood models for 
all hypothesis, i.e., there exists an a > 0 such that if 
/* (s*) > 0 then t (s*|0) > a for all 9 € Q. 

Uniform prior beliefs satisfy the Assumption [^^a), which 
is a reasonable assumption if there is no initial information 
about the hypotheses quality. In Eq. (|2]l, if pL].{9) = 0 for 
some hypothesis 9 and for some agent i, at some instance k, 
then all beliefs of all agents will eventually become zero at 
that hypothesis. Assumption[3ta) removes the undesired effects 
of this property which could lead to the inability to learn. In 
addition, AssumptionOb) guarantees the sub-Gaussian behav¬ 
ior of the observed random variables. Specifically, the derived 
convergence rates use results from the measure concentration 
of random variables. In the most common setting, the ran¬ 
dom variables must have a sub-Gaussian or sub-exponential 
behavior in. 


C. Results 

We now state our first result; we show that the dynamics in 
Eq. (| 2 ]l concentrates the beliefs on the optimal set 0 *, which 
is precisely the set that best describes the observations. This 
theorem will be proven in Section IIVI 

Theorem 1 Under Assumptions Q] and 13 the update rule of 
Eq. m has the following property: 

lim = 0 a.s. for all 9 ^ 0*, z = 1,..., n. 

k—foo 

Our results regarding the non-asymptotic explicit conver¬ 
gence rate of the update rules in Eq. (O and Eq. Q are given 
in Theorem |2] and Theorem [3 while their proofs are provided 
in Section IVl and Section IVTl respectively. 


Theorem 2 Let Assumptions\I\and\^ hold and let p € (0,1). 
The update rule of Eq. m has the following property: there 
is an integer N (p) such that, with probability 1 — p, for all 
k > N{p) and for all 9y ^ 0*, we have 


pf{9v) < exp ( -- 72+71 


for all i = 1,... ,n 


where 

N{p)^ 


48(loga)^logi 


75 


P 

PhiOy) 

Pf){.Sw) 


= max < max log 
72 = - min (C* - Cq(6»„)) 


12 log n, 1 

--^log- 

1 — A a 


with a from Assumption\^b), p from AssumptionUfd) and A 
given by: 




If each Ak is the lazy Metropolis matrix associated with 
and B = 1, then 

1 


A = 1- 


0(n2 


In words, the belief of each agent on any hypothesis outside 
the optimal set decays at a network-independent rate which 
scales with the constant 72 , which is the average Kullback- 
Leibler divergence to the next best hypothesis. However, there 
is a transient due to the 7 J term (since the bound of Theorem|2] 
is not below 1 until k > 27 J/ 72 ), and the size of this transient 
depends on the network and the number of nodes through the 
constant A. 

Observe that the term y\ represents the influence of the 
initial beliefs as well as the mixing properties of the graph. 
If all agents use uniform initial beliefs, i.e., pg = 1/|0|, then 
the effect of the initial beliefs is zero and y\ reduces to 


i 121 ogn 1 

7i = r “ 

1 — A a 


where the constant A may be thought of as the “time to 
ergodicity” of the inhomogeneous Markov Chain associated 
with the matrix sequence Ak. On the other hand, if one 
can start with an informative prior where p^if) *) > Ph{S), 
the influence of the initial beliefs will be a negative term, 
effectively reducing the transient time. 

Our next result shows the belief concentration rate for the 
update rule described in Eq. Q. 


Theorem 3 Let Assumptions^and\^hold and let p C (0,1). 
Eurthermore let U > n and let a = 1 — 2 j (9(7 + 1). Then, the 
update rule of Eq. (O with this a, uniform initial beliefs with 
the condition pl_i{9) = Pq{ 9) and /3!_i fixed to zero, has the 
following property: there is an integer N{p) such that, with 
probability 1 — p, for all k > N (p) and for all 9y ^ 0*, it 
holds that 

dWv) < exp (-^12 + 7i ] for all i = 1,... ,n, 


where 


Ar(p)4 


^48 (log a) ^ log (- 
72 \P 


i ^ 41ogn 1 

with a from Assumptions^ b) and A = 1 — 


Note that the beliefs for k = —1 and k = 0 are defined 
equal. Additionally, we assume there is no observation avail¬ 
able for time 0 , this holds if we assume ^^ 1=0 with any 
realization of S'g. 

The bound of Theorem 3 is an improvement by a factor 
of n compared to the bounds of Theorem [J] In a network 
of n agents where a, p and 72 are treated like constants with 
respect to the number of agents, we require at least 0{n log n) 
iterations for the beliefs on the incorrect hypotheses to be 
below certain small value epsilon (assuming U is within 
a constant factor of n). Eollowing the results of ll29l , the 
best bound one could get using a Metropolis weights is 
0{nf logn), as in Theorem |2] if B = 1. 

We note, however, that the requirements of Theorem [3 are 
more stringent than those of Theorem|3 The network topology 
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is fixed (i.e. a static graph) and all nodes need to know an 
upper bound U on the total number of agents. This upper 
bound must be within a constant factor of the number of 
agents. 


III. Generalized Distributed non-Bayesian 
Learning 


In this section, we discuss a general class of distributed non- 
Bayesian algorithms. First, we will motivate the choice of the 
update rules described in Eq. (O and Eq. Q. Eor simplicity 
of exposition, we will assume that the agents always obtain 
observations (i.e. = 1 in Eqs. @ and ([3]l for all i and k). 

Then, we will provide a comparison between our algorithms 
and previously proposed algorithms within the generalized 
distributed non-Bayesian framework. 

Standard centralized Bayes’ rule can be described as the 
solution of a constrained optimization problem ma, m, 
113 . The cost function to be minimized is composed of two 
terms: one being the Maximum Likelihood Estimation (MLE) 
of a state given the observed data and the other being a 
regularization function minimized by the current prior ll46ll . 
i.e., 


^J.k+li.^) = argmin \ Dkl (7r||Mfc) 

7reP(e) *■ 


-E^[log (£(sfc+i|0))]} 


_ fik{d)Ksk+i\d) 

(Op) e{sk+i\dp) 

where Sfc+i is the most recent observation, £{-\0) is the 
likelihood function for hypothesis 6, is the expected value 
with respect to the probability distribution tt, and P (0) is the 
set of all probability distributions on the set 0. 

We can modify the optimization problem associated with a 
Bayesian update to take into account the network structure. We 
change the KL divergence term from a single prior belief to a 
convex combination of the beliefs of an agent and its neighbors 
in the network. The corresponding optimization problem for 
agent i is: 


= argmin Y^[Ak]iJDKLi^T\\^^i) 

^eiP(e) 

-E^[log(r(4+i|0))] 

Observe that the solution of this optimization problem is 
precisely the proposed update rule in Eq. Q. 

Opinion pooling or opinion aggregation has been studied 
before in 0, m, 0, HO). It is considered a traditional 
problem in economics, where several experts have beliefs 
about a hypothesis and one needs to aggregate their beliefs into 
a single probability distribution. Different opinion aggregation 
functions result from using different divergence metric for 
probability distributions (see ||48|). Similarly, different opinion 
pool operators define different non-Bayesian distributed learn¬ 
ing rules. A general form of opinion pooling was introduced 


in im, termed g-Quasi-Linear Opinion pools (g-QLOP), 
defined as follows: 




ff-'l 

(E”=i [Akhgipim] 

1 

Ep=i 9 

'4 

[EU 

'p))) 


with : nr=i ^ (®) ^ (®)' The g-QLOP corresponds to 

weighted arithmetic averages when g{x) = x and to weighted 
geometric averages when g{x) = logx. 

The update rules studied in this paper can be seen as a two- 
step procedure. First, the beliefs of the neighbors are combined 
according to an opinion aggregation function. Second, the 
resulting aggregate distribution is updated using Bayes’ rule. 
The proposed update rule, see Eq. (O, uses the Logarithmic 
Opinion Pool, where 




e:=i n"=i Pi (Op) 


[^fc] 


thus 


Pk-kli^) 


( 

logx • 


^*(4+11^) 

E-^p—1 \ogx \ 

■. ■, p-iiOp), ■ 

■ •) 14) 


Logarithmic Pools are externally Bayesian a, ia, i.e. 
the order of aggregation of beliefs and the of new evidence 
does not influence the update rule. That is, from a learning 
point of view, if the function is Externally Bayesian, we can 
interchange the innovation and diffusion steps. The order in 
which we aggregate opinions and make the Bayesian update 
does not change the update rule. The next proposition shows 
that the update rule in Eq. ([3 is externally Bayesian. 


Proposition 4 Assume that j3]^ = 1 for all i and k in the 
update rule Eq. da. Then, this rule is externally Bayesian, i.e. 
Eq. © is equivalent to: 


Pk-VlW 


‘ log X 


j:7=iPii0p)£Ksu^\0p) 



Proof: First generate a posterior taking as prior each of 
the opinions in the neighbor set: 


Then combine the resulting /i® into a new posterior, 
denoted by as follows: 

Pk+li^) = '^loga: (• ■ ■ ! P’j,k-Vli^)y ' ' ') 

E7=inUpWi 

Substitute the expressions for /r4+i(^) *•1'® preceding 

relation to obtain 


Pk-klW 


TT" f pi(gK(4 + ilG \ 

W"® TT" ( 4 (4)^* (®fc+i 14) ^ 
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(4+ii^p) 

where the last equality is obtained by noting that the term 
T^j=l{Y.'^=^^^i{Sq)f■"{sl+l\Oq)i^^^'^ cancels out from the 
numerator and the denominator. The last relation is the same 
as Eq. (|2ll, so that ■ 

Consider now a Linear Opinion pool, where 

n 

i=i 

If the opinion aggregation is done first, as studied in lIZTll . then 
the resulting update rule is 


y^k+iid) = 


sr=i[^fc]»jA^i(6'K(4+il^) 


On the other hand, if the Bayesian update is done first, then 
the resulting update rule is 


authors propose an update rule where every agent performs 
local Bayesian updates before aggregating their beliefs using 
geometric averages, i.e. 

^^,{9)P{sU^9) 

■■■’Er=iMi(^p)^^(4+il^p)’''' 

Convergence results for fixed communications matrices are 
provided, as well as asymptotic characterizations of the rates 
of convergence. Later in ||28]| . the authors extended the charac¬ 
terization of the rate of convergence to large deviation theory, 
providing a statement about the existence of a random time 
after which the beliefs will decrease exponentially. 

IV. Consistency of the Learning Rule 

This section provides the proof for Theorem [T] We begin 
with a sequence of auxiliary lemmas. Lirst, we recall few 
results from R3ll about the convergence of a product of doubly 
stochastic matrices. 


Mfc+i(^) 





Mfe+l(^) / ^ jfn \pj( j in 


(4) 


The Linear Pool-based update rule is similar to the update 
rule proposed in ifTTl . The authors in ifTTll proposed the 
following rule 


— '’'x 


Sp=i Mfe (^p) 


where opinion aggregation with linear functions is performed 
locally with priors from the neighbors. The main difference is 
that in Eq. a convex combination of the posteriors received 
from the neighbor set is used to generate the new individual 
posterior, while in IfTTll the update rule is a convex combination 
of the individual posterior and the neighbors’ priors. 

In HD, the authors considered the case where the ran¬ 
domized gossip algorithm defines the communication struc¬ 
ture. The update protocol is based on a distributed version 
of the Nesterov’s dual averaging with stochastic gradients 
corresponding to the log-likelihood models given a set of 
observations. In this case, the agents exchange the likelihoods 
of the current observations instead of the beliefs. Thus, the 
consensus step is performed as a geometric aggregation of the 
likelihoods, and the resulting update rule can be described as 





where Wk is the communication matrix coming from the 
gossip protocol. 

The idea of communicating aggregated versions likelihoods 
instead of beliefs was previously studied in the context of 
distributed estimation in sensor networks m. Approaching 
the problem from the point of view of the Belief Propagation 
algorithm, resulted in an update rule in the form of Eq. Q. 
In Il24l . the authors showed convergence results for primi¬ 
tive, rings, tree, random graphs and other extensions to the 
original belief propagation algorithm. Similarly, in ll28ll . the 


Lemma 5 4451/ , 4421/ Under Assumption \I\on a matrix se¬ 
quence {Ak}, we have 



1 

n 




y k>t>0 


where A G (0,1) satisfies the relations described in Theorem^ 


Proof: The proof may be found in 1431 . with the exception 
of the bounds on A for the lazy Metropolis chains which may 
be found in m- ■ 

Next, we present a result regarding the weighted average of 
random variables with a finite variance. 


Lemma 6 Assume that the graph sequence {Gk} satisfies 
Assumption\T\ Also, let Assumption\^hold. Then, for 9^ ^ 0* 
and 9^ S 0*, 

lim (+ -lr.l'^H{9,,9^)] =0a.s. 
fc^oo \ k n I 

(6) 


where Cf 


is a random vector with coordinates given by 



/3t-ilog 


^KSI\9.) 

HSl\9^) 


Vi = 1,...,n 


while the vector H{9v,9w) has coordinates given by 


Hf9^,9^) = f {DKL{f\\ei-\9^)) - Dkl ir\\e (- 10 ^))) . 

Proof: Adding and subtracting to 

the expression under the limit in Eq. (O yields 


1 


lA 1 






■'t 


1^ 1 


n V 


n 


( 7 ) 











By Lemma[5j limfc_,.oo A^-t = for all t > 0. Moreover, 

by Assumption [3b), we have that loga < < log 

Thus, the first term on the right hand side of Eq. © goes to 
zero a.s. as we take the limit over fc —^ oo. 

Regarding the second term on the right side of Eq. ©, by 
the definition of the KL divergence, and the assumption of 
each 131 being independent, we have that 


E 


Pl-i^og 


^Hsi\0v) ' 




£* (s|6»^) 


= 9* XI /*(«)^og 


( t{s%) f{s) \ 
\£^{s\6^)ns)) 




P{A) 

£*(s|6»^) 


-Y^r{s)\og 


= ^\ . _ 

\s€<S^ 

= P {Dkl {fit (-10^)) - Dkl {fit (-10.))) 


f{l 

t{s\91 


or equivalently 


Kolmogorov’s strong law of large numbers states that 
if {Xt} is a sequence of independent random variables 
with variances such that ^ 

^ ELi ^ ELi E[X,] ^ 0 a.s. Let X* = if' 
then by Assumption ©b), it can be seen that 
supt>QVar(Xj) < oo. The final result follows by Lemma [3 
and Kolmogorov’s strong law of large numbers. ■ 

Lemma |6| provides the necessary results to complete the 
proof of Theorem [T] 

Proof: {Theorem © Initially, lets define the following 
quantities; for all* = 1,..., n and k > 0, 


TU0v,0w) = log 


Tk{0v) 

Tk{0yf) 


( 8 ) 


defined for any Oy ^ 0* and Oy, € 0*. We also use these 
quantities later in the proof of Theorem |2| 

Let agent i be arbitrary and consider the update rule of 
Eq. ©. We will show that ^J,].{0y) —>■ 0 as A: —>■ c» for all 
i = 1,... ,n. Note that if 9y G Q*\ 0*, then as a consequence 
of Assumption|3a) we have that ^k\.{0y) = 0 for all i and large 
enough k. Thus, we consider the case when 9y ^ 0* in the 
remainder of this proof. 

Using the definition of ip1.{9y,9yj), it follows from Eq. © 
that 


Tk+l{0v,0w) 


, Pk+l{0v) 

n 

Y^[Ak]ij(pi {9y, 9yy) + /3^ lOg 


\9y)0'>‘ 

^lSl^,\9y) 

t{si+M- 


Stacking up the values {9y,9y,) for i = 1,..., n, into 
a single vector iPi^j^i{9y^9y,), we can compactly write the 
preceding relations, as follows: 

'•Pk+l{0'UT0w) = Ak<-Pl^{9y^9yy) + C (9) 


0 0 m 

where is defined in the statement of Lemma |6| Now, 

the relation in Eq. © implies that for all k > 0, 

k 

‘Pk+l{0V, 0w) = Ak-.0PQ{9y, 9yj) + Ak:tt£” ” + . 

t=l 

(10) 

The, if we add and subtract 'Yl!t=i^^ntnH{9y,9yj) in 
Eq. ([Tot , where H(9y,9yj) is as in Lemma |6l it follows that 

k -Jl 

‘■Pk+li^V: 0w) ~ ^vf) ^ ^ t[ {9y^ 9yj^Xyi 

^ i=l 

k , 

I + X! \^Ak:t.Cf'”^'“ + —lnlnH(9y,9yy) 

By the definition of group confidence (cf. Definition©, we 
have 

n 

Y,H\9y,9y,) = Cg{9y,) - Cg{9y) = - Cg{9y) (11) 

i=l 

where the last equality follows from 9yj G 0* and the 
definition of the optimal value C* (Definition©. Therefore, 

(Pk+l(0v,0w) = Ak:O‘Po{0v,0w) - — (Cq - Cq(^*i;)) In 

+ X “I- '^ntnH{9y,9y;) \ + . 

t=l ^ ^ 

By dividing both sides of the preceding equation with k and 
taking the limit as k goes to infinity, almost surely we have 

lim yip^,^{9y,9yy)= llm yAk:0<PQ{9y,9yy) 

fe—foo AC k—¥<x> AC 

+ lim + -lnlnH{9y,9yy)] 

+ lim - - (Cq - Cq(0„)) 1„. (12) 

fe^oo k ^ n ^ 

The limit on the left hand side of Eq. (fTSIl is justified since 
all the limits on the right-hand side exist. Specifically, the 
first term of the right-hand side of Eq. (fTSI) converges to zero 
deterministically. The second term converges to zero almost 
surely by Lemma |6l while the third term goes to zero since 
£(’'’®"’ is bounded almost surely (cf. Assumption ©b)). 
Consequently, 

lim J(p,^_^^{9y,9yy) = -- {C* - Cq{9y)) In u.s. 
fc^oo k n ^ 

Since C* is the maximum value and 9y ^ 0*, it follows that 
C; - Cq(0.) > 0, implying that ipi^{9y,9yj) —)► — oo almost 
surely. Also, by y.\{9y) < exp {ipl.{9y,9yj)) for all i, we have 
p\{9y) —>■ 0 a.s. ■ 

One specific instance of our setup is when there exists a 
unique hypothesis that matches the distribution of the observa¬ 
tions of all agents. This case relates to the previously proposed 
approaches for distributed learning. Specifically, in Ei, m, 
El, the authors assume that there is a “true state” of the 
world, i.e., there is a unique hypothesis such that the distance 
between such hypothesis and the true distribution of the data 
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is zero for all agents. This case could be expressed as a 
consequence Theorem [T] as follows: 

Corollary 7 Under assumptions of Theorem [7] if there is a 
unique hypothesis 9* with C* = 0, then 

lim pf{9*) = 1 a.s. Vi € V. 

k—foo 

Proof: By Theorem [T] for every 6 f 6* we have that 
lim fJ.l.{9) = 0 a.s. ■ 

k—foo 

In general, one can consider several closed social cliques 
where the same hypothesis can represent different distributions 
for different groups. For example, in a social network, what 
one community might consider as a good hypothesis, need 
not be good for other communities. Each disconnected social 
clique could have a different optimal hypothesis, even if all 
observations come from the same distribution, see Figure |2l If 
such social clicks interact. Theorem [T] provides the conditions 
for which all agents will agree on the a hypothesis that is the 
closest to the best one considering the models of all agents in 
the network and not only those in a specific clique. 



Fig. 2. Conflicting social groups interacting. Initially on the left, there are 
three isolated social clicks, each with a different optimal hypothesis. Once 
such groups interact (on the right), others might influence the local decision 
and a click changes its beliefs to the optimal with respect to the complete set 
of agents. In this case, one of the groups was convinced that 6 \ was a better 
solution than 62 - 

The previous statement is formally stated in the next corol¬ 
lary. 


Corollary 8 Let the agent set V be partitioned into p disjoint 
sets Vj,j = 1,... ,p. Under assumptions of Theorem\I\where 
each agent updates its beliefs according to Eq. (El, if there 
exists a hypothesis 6* such that 


the?! limfc_^oo = 1 ti.s. for all i. 


Proof: If the hypothesis 9* exists, then the group con¬ 
fidence on 9* is larger than the group confidence for any 
other hypothesis. Thus, 0* = {0*} and the result follows 
by Theorem [U ■ 


V. Rate of Convergence for Time-Varying Graphs 

In this section, we prove Theorem |2l which provides an 
explicit rate of convergence for the learning process. 


The next lemma is an extension of Lemma 2 in ll29l to the 
case of time-varying graphs. It provides a technical result that 
will help us later on the computation of the non-asymptotic 
convergence rate. 


Lemma 9 Let Assumption\T\hold for a matrix sequence {Ak}. 
Then for all i. 


k n 

EE 


t=i j=i 


[Ak:t\ij 


1 

n 


41ogn 
- 1-A 


where A = 1 — rj/Anf, and if every Ak is a lazy Metropolis 
matrix then A = 1 — l/0{n^). 


Proof: In ||29l, the authors assume the weight matrix 
is static and diagonalizable, then they use the following 
inequality from ISTIl : 

\\elA^ - tt'IIi < nAtnax(^)'' 

where ej is a vector with its j-th entry equal to one and zero 
otherwise, tt is the stationary distribution of the Markov Chain 
with transition matrix A and Amax(^) is the second largest 
eigenvalue of the matrix A. 

For time-varying graphs one can use the inequality in 
Lemma|5]instead. The reminder of the proof remains the same 
as in ||29l. ■ 

Before proving Theorem E we will provide an auxiliary 
result regarding bounds on the expectation of the random 
variables as dehned in Eq. ((S]). 


Lemma 10 Consider (p\{9y,9y^) as defined in Eq. ((S]), with 
9yj G 0*. Then, for any 9y Q* we have 

{9y, 0 U,)] < 7 ^ — k ^2 for all i and k > 0 
with 7 ^ and 72 as defined in Theorem E 

Proof: Taking the expected value in Eq. (fTOt we can see 
that for all fc > 0, 

n 

= '^[Ak-.o]ijToi^v, 9 w) 

9 = 1 
k n 

-EE V H^9^,9^)-H\9^,9^). 

t=i j=i 

fc+1 n 

By adding and subtracting ^ ^ —H^(9y,9u,), we obtain 

t=i3=1 " 

n 

- E E (- -) ^ E 

j=i V 11/ n 

- . (13) 

For the first term in Eq. (fTSll . since Ak-Q is stochastic matrix, 
we have that 

^ max log ■ 
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The second term in Eq. (fTSl l can be bounded using 
Lemma [21 thus 


f '\h^{9v,9w)< 

t=lj=l ^ / 

since log a < H^{9y,9^) < log i. 

The last term in Eq. ( IT3] | is bounded as 


1 ” 

-J2{H\0v,9^)-W{9y,9y,))<2\og 


4 log n , 1 

-j-^log- 

1 — A a 


i=i 


81ogn 1 

< -j—r “ 

1 — A a 


where the last inequality follows from 2 < 8 log n for n > 2 
and 1 — A < 1. 

Einally we have that 


^iplj^i{9y,9y,)\ < maxlog 

i 


9oi0v) 12 logn j_ 
9'oi9w) 1 - A a 

n ^—' 

i=i 


from which the desired result follows by using the definitions 
of 7 |, 72 , H^{9y,9yj) and taking the appropriate maximum 
values over 9y and 9y, on the right hand side of the preceding 
inequality. ■ 

In the proof of Theorem [2] we will use McDiarmid’s 
inequality li52l . which provides bounds for the concentration 
of functions of random variables. This inequality allows us to 
show bounds on the probability that the beliefs exceed a given 
value e. For completeness, next we state the McDiarmid’s 
inequality. 


Theorem 11 (McDiarmid’s inequality 1521/ ) Let 2fi,... 
be a sequence of independent random variables with Xt G X 
for 1 < t < k. Further, let g : X^ M. be a function of 
bounded differences, i.e., for all 1 < t < k. 


sup g{... ,xt,..- inf p(...,xt,...) < Ct 

xt&x 


then for any e > 0 and all k> 1, 


p {gimti) - > e) < 




Now, we are ready to prove Theorem [^ 

Proof: (Theorem Ell First, we will express the belief 
ffk+ii^v) in terms of the variable ‘pX^i{9v,9iy). This will 
allow us to use McDiarmid’s inequality to obtain the con¬ 
centration bounds. By the dynamics of the beliefs in Eq. (O 
and Assumption EJa), since G (0,1] for 0^, G 0*, we 

have 


fUOv) < = exp {pl{0v,dw)) ■ 

Therefore, 

p > exp (-^12 + ^ 


< P ^exp {ipl{9y, 9y,)) > exp ^--^72 + 7l^ ^ 

= P > --^72+7?^ 

< P (^ipli9y,9^) -E[ipl{9y,9^)] > ^72^ 

where the last inequality follows from Lemma [TO] 

We now view <p]^_^_i{9y,9yj) as a function of the random 
vectors Si,..., Sk, (see Eq. (flOllL where St = {S ^,..., S'”) 
for t > 1, and the random variable S|,_|_]^. Next, we will 
establish that this function has bounded differences in order 
to apply McDiarmid’s inequality. 

For all t with 1 < t < k and j with 1 < j < n, we have 


max (pl_^_ff9y,9yy) - min ipl^j^(^9y,9yj) 
si eS->' si est 


= max [Afc:t]y log 

St 


eHst\9y) 

i^{st\0w) 


min [Afe:t]y log 

St GS^ 


e^{st\9y) 

i^{st\0w) 


< [Ak-.t]^j log ^ + [Ak:t]zj log ^ 

= 2[Ak-.t]ij log 

Similarly, from Eq. (ITOl) we can see that 

max (pXff9y,9yy)- min ipX ff9y,9yy) < 2 log-. 


It follows that ip\j^i{9y,9y,) has bounded variations, with 


k n 1 

Y^Y.(2[Au..t]^ffog-f + 

t=i j=i “ 



= 4 



k n \ 

'^{[Ak-.t]ij )'^+1 j 


< 4 



2 

(fc + 1) 


where the last inequality follows from the fact that Ak-.t is row 
stochastic. 

Thus, 


E(^pl{9y,9y,)-E[pl{9y,9y,)\>^^^ 

2(^fc72)^ 

4fc (logi)' 

Therefore, for a given confidence level p, in order to have 
P {Pki^v) ^ oxp {—^kj 2 + 7i)) < P we require that 

1 9 1 

k > ^8 (logo) log -. 

72 P 




VI. Accelerated Learning eor Fixed Undirected 
Graphs 

In this section, we analyze the distributed learning algorithm 
of Eq. (Ell and prove its non-asymptotic convergence rate. 
First, we will state an enabling theorem presented in El, 
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which presents a distributed consensus protocol that achieves 
a consensus with a linear growth in the number of agents. 

Theorem 12 i[50|/ Suppose each node i in a fixed undirected 
connected graph updates its variable x\ at each time instant 
k > 2 as follows: 


Vk+i — ^ 


xl - XI 


2 maxjd* + + 1} 


^k+i — Vk+i + 1 ~ 


lly,, - < 2 ( 1 - ^ 


B = 


{1 + a) A —cjA 

Ir,. 0 


CT = 1 — 


[[/„ 0\B'^[In ln\'h - 


1 


< V fc > 2 


where A = 1 — 


Vk+i 

Vk 


{1 + a) A —aA 
L. 0 



Vk 

= B^ 

' Vi ' 


. Vk-l . 


. Vo _ 


\\[l^Q]B’^[I^I^\'y,-(-Y,y\]U\ 


<21- 




1 

gut '1^1 “ 

which implies that 

n 

71 < ^ 


(18) 


max 


(14a) 




y\'^n\\2- 


-) {yl+1 - vi) (14b) 


i=l 


9t/ + 1^ 

where Ni is the set of neighbors of agent i and d* is its 
corresponding degree. Then, if U > n we have that 

k-l 


The preceding relation holds for any y^. In particular, if we 
take yi = ej, where ej is a vector whose j-th entry is equal 
to one and zero otherwise, we conclude that for every i and 
h 


Vi - 2:1||2 Vfc > 1 (15) 


[[/„ Q\B^[In In]'Vj - ^ 


<V2 1- 


18U 


where [y^Jj = and x = process is 

initialized with y\ = x\. 

Next, we define some quantities that we use in the analysis 
of Eq. (O. Dehne the matrix B and a scalar a, as follows: 


(16) 

(17) 


This follows from the inequality y/1 — fi < 1 — /3/2 for all 
fi G (0,1) and the fact that \\ej — ^ln\\ <1- ■ 

Now, we are ready to proof Theorem [3 

Proof: (Theorem O The proof is along the lines of the 
proof for TheoremlH From the definition of we 

have 

* fa a \ — ^ k'k+ii^'y) 


9U+1 

where /„ is the identity matrix and 0 is the matrix with all 
entries equal to zero of the appropriate size and A is as defined 
in Assumption 121 

We have the following auxiliary result for the matrix B. 

Lemma 13 Consider the matrix B and the parameter a as 
defined in Eqs. (O and (O respectively. Then 


= log 


Mfe+i i^w) 

pi(9y) 


= ^ (1 + cr) Ay log 


i=i 


P-ki^w) 


i=i 


pi-i(9v) 

P{Si\9,) 


n n 

= ^ (1 + cr) A^jipl{9y,9.u,) - 

i=i i=i 


Proof: The linear time consensus algorithm described in 
Eq. (O can be expressed as 

Vk+i = Axk 

Xk+i = yfc+i + CT (y^+i - yfc) 

which implies that y^+i = A (y^ + cr (y^ - y^.i)) with 
y\ = x\. Therefore 




where we assumed that yp = y^. Thus, 

Vk+l = [In 0]i?"[/n /n]'yi. 

By substituting the previous relation into Eq. (fTSl l and using 
xi = yi, we obtain 


^o-Ay[£^”’ “Ij. 
j=i 

Stacking the previous relation for all i we obtain the following 
vector representation for the dynamics 

'Pk+l{^v^ 9w) = (1 + 9yf) — (jAipf._-^{9y, 9i^) 

+ 4;^-aA£^’«™. (19) 

Now, define the following auxiliary vector 

Zk+i{9y,9yj) = ipk{9v, 9^) + Efc+i” 

where Zo{9y,9yj) = 0, since = 0 by the 


assumption of uniform initial beliefs, and £, 


d„,s„ 


= 0 due 


to /3_i = 0, in which case we can set to any value in S\ 
By writing the evolution for the augmented state 
{‘Pk+i{^v,9w) Zk+ii9y,9^)y we have 


Vk+li^v, 9w) 

^k+1 {9y, 9qjj'j 


= B 


(Pk{9v, 9m) 

^k{9y., 9m) 


8„,e„ 

k+1 

0„,0„ 

k+1 
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which implies that for all A: > 1, 


Now, by adding and subtracting the term 1 jn we have that 


^k+1 {Pvt Gyj) 




B 


i-t 


fc+i 


Ow) 


ney,d^ 


r^v,o-u. 

^k+l 

^fc+1 


Then we have 


>Pk{^v,e^) = [In 0]B>^[In In]'>PoiSv, On,) 
k 

+ '^[In 0]B'^-*[In In]'Ct''’^- 

t=l 

where the assumption of uniform initial beliefs sets the first 
term of the above relation to zero. 

The remainder of the proof follows the structure of the 
proof of Theorem [2l where we invoke Lemma [13] instead 
of Lemma [5| First, we will find a bound for the expected 
value of 6^,) and later we will show this is of bounded 

variations. In this case, we have 

k n 

EK( 0 „, 0 ^)] = -Y^Y}Vn 0 ] 5 '=-*[/n InWjH^iOn.en,). 

t=l j=l 


k n , 1 \ 2 

[[/„o]B'=-‘[/„/„]%2iog- 

X 2 fe n 

EE 0]B'^-^[In In]'h - 1/n)' 

t=l j=l 

^ / t = l j=l 

where we have used < 2{{x — y)'^ + y^). 

We can bound the first term in the preceding relation using 
Eq. ifTSI) with Ui = Bj since Eq. (fTSIl hold for any choice of 
Ui- Specifically, we obtain that for all j = 1,..., n 

n / 1 \ 

y] ([[/„ - l/nf < 2 f 1 - — j . 

Additionally, note that [/„ In]' is a symmetric matrix 

since is a polynomial of A which is symmetric itself. This 
in turn implies that [/„ /„]' is also symmetric. 

Therefore, it holds that for all i = 1,..., n 


k n 

By adding and subtracting ^ ^ —HAOn^On,) we obtain 


h 

E[^l{9n,en,)]=—y^HA0n,0n,) 
n ^' 
j=i 

k n / ^ 

+EE 

i=i j=i V”- 


HAQvt&w)- 


Similarly, as in the proof of Theorem |2| we bound the term 
in parenthesis using the non-asymptotic bounds from Lemma 
fOlin conjunction with Lemma [9| By doing so, it can be seen 
that 




41ogn 

1-A 



h 

-y^HAOn.On,). 

n 

j=i 


Now, we will show that ip\{0v,9n,), as a function of the 
random variables consisting in for 1 < f < fc to 1 < j < n, 
has bounded variations and we will compute the bound. First, 
we fix all other input random variables but [C^'“’^'“]j and we 
have 


max (pl{9n,9n,) - min 
s^e5J sjeSJ 

= max[[/„ 0]B^-‘[4 4]'].,[4”’"“], 

- min [[4 0]B’‘-*[In 

< [[In 0]B'^-Vn 4]']*, 2 log-. 

a 

Thus, the summation of the squared bounds in McDiarmid’s 
inequality is 

k n 

i=l j=l 



[[In 0]B'=-‘[4 4]']*j21og- 


n / 1 \ 

y] ([[4 0]s'=-‘[4 ln]'h - 1/^)' < 2 f 1 - —) 

j=l ^ / 

Finally, we have 

k n 

<8(lc,gi) + 

< 24 (loga)^ k. 



[[In 0]B'=-‘[4 4]']i,21og- 


Now, by the McDiarmid inequality and getting the values 
of k such that the desired probabilistic tolerance level p is 
achieved, we obtain 

P (^ipU9n.9,n)-E[y,li9n,9,n)] > 

24 (loga)^ k j 
kl2 

48 (log a) 

Therefore, for a given confidence level p, in order to have 
IP {k-kiOv) > exp (—ifc72 + 7i)) < p we require that 

1 2 1 
k > —^48 (log a) log -. 

72 P 





Next, we will present simulation results that show how the 
convergence time depends on the number of agents in the 
network. Figure [3] shows the time required for a group of 
agents to have a set of beliefs at a distance of e = 0.01 from 
the singleton distribution around the optimal hypothesis. For 
example, on a path graph, as the path grows longer, the number 
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of iterations required to meet the desired e accuracy grows 
rapidly. This is due to the low connectivity of the network. 
The time required for consensus is smaller for the circle and 
the grid graphs due to their better connectivity properties. 




(a) Path Graph 


(b) Circle Graph 


graph structure shows that agent 1 communicates with agent 
2, similarly, agent 3 communicates with 2. The star represents 
the target. 

Each agent constructs likelihood functions for its hypotheses 
based on its sensor model. The observations follow a truncated 
normal distribution with the mean proportional to the distance 
between the agent and the grid point of the corresponding 
hypothesis. For example, assume an agent i is in a position 
Pa = (^a)2/o) the target is located at ps = {xs,ys)- The 
received signals are = ||ps ~ Poll + where c is some 
positive constant and is a truncated zero mean Gaussian 
noise. Now, consider that a hypothesis 0 is at a point ps = 
{xe,yg). The corresponding likelihood model under hypothesis 
6 assumes observations are Sl.\9 = \\ps — _Pa|| + cW^. 

Figure |4|b) shows the likelihood functions for 9^ and 9^ of 
agent 2, clearly hypothesis 9^ is closer to the true distribution 
of the observations /^. Note that there is not a “true state 
of the world” in the sense that is not equal to any of the 
hypotheses in the grid. 



Number of nodes 
(c) Grid Graph 


Fig. 3. Empirical mean over 50 Monte Carlo mns of the number of iterations 
required for < e for all agents on 0 ^ ©*. All agents but one have all 

their hypotheses to be observationally equivalent. Dotted line for the algorithm 
proposed in (m, Dashed line for the procedure described in Eq. <2} and solid 
line for the procedure described in Eq. s. 


vn. Numerical Example: Distributed Source 
Focalization 

In this section we apply the proposed algorithms to the 
problem of distributed source localization based on differential 
signal amplitudes ||5^ . |[54l . Il55l , ||5^ , ||57]| . We compare the 
performance of our methods, Eq. (|2]l and Eq. Q with the 
algorithms proposed in EU, ini. For simulation purposes we 
will assume the graphs are fixed and there exists a single 9* 
such that /* = £'^ (-16**) for all i, in which case our update rule 
simplifies to the learning algorithm proposed in ll29l . 

Assume a group of n agents is randomly distributed in an 
area and each agent receives a noisy signal proportional to 
its distance to a target. The group objective is to collectively 
find the location of the target. Each agent constructs a grid of 
hypotheses about the possible location of the source. Figure 
|4ja) shows a 10 by 10 area partitioned in a 3 by 3 grid, 
which results in 9 hypotheses. Moreover, there are three 
agents (represented by circles), at different locations. The 



(a) Network of 3 agents 



Fig. 4. (a) Group of 3 agents in a grid of 3 X 3 hypotheses. Each hypothesis 
corresponds to a possible location of the source. For example, hypothesis 62 
locates the source at the ( — 10, 0) point in the plane, (b) Likelihood functions 
for 02 and and distribution of observations for agent 2. 

The information each agent obtains is enough just to 
estimate the distance to the source, but not its complete 
coordinates. For instance, a single sensor can only locate the 
source within a circular band around it, see Figure |5] 

Figure |6](a) shows another group of 20 agents now in¬ 
teracting according to an appropriate network structure, see 
Assumptions [T] and [2l A finer grid partition has been used. 
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Fig. 5. Belief distribution of one agent over the hypotheses grid. Darker 
shades of gray indicates higher beliefs on the corresponding hypothesis. 


where each coordinate has 100 points, resulting in 10000 
hypotheses in total. Figure [6lb) shows the belief on the 
hypothesis 9*, defined to be the grid point closer to the location 
of the target. 

Figure |7] repeats the simulations presented in Figure |6] but 
including 10 agents with all their hypotheses observationally 
equivalent (i.e. no measurements available), and 3 conflicting 
agents whose observations have been modified (corrupted) 
such that the optimal hypothesis is the (0, 0) point in the grid. 

Figure |2lb) shows the protocols presented in Eqs. © and 
(12l concentrate the beliefs onto the optimal hypothesis. The 
performance of the algorithms in ifTTIl and |[27l deteriorates if 
conflicting agents are present. This is evident from the lack of 
concentration of the beliefs around the true hypotheses. 


10 
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(a) Network of Agents 



(b) Belief of one agent on the optimal hypothesis 


Fig. 6. (a) Network of agents as well as the belief distribution over the 

hypothesis set (a grid in the x, y location). Darker shade of gray indicates 
higher beliefs on the corresponding hypothesis (point in the hypotheses grid), 
(b) Belief evolution on the optimal hypothesis 6* for different belief update 
protocols. 
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VIII. Conclusions and Future Work 

We proposed two disfributed cooperative learning algo¬ 
rithms for the problem of collaborative inference. The first 
algorithm focuses on general time-varying undirected graphs, 
and the second algorithm is specialized for fixed graphs. 
In bofh cases, we show that the beliefs converge to the 
hypothesis set that best describes the observations in the 
network. We require reasonable connectivity assumptions on 
the communication network over which the agents exchange 
information. 

Our results prove convergence rates that are non-asymptotic, 
geometric, and explicit. The bounds depend explicitly on 
the graph sequence properties, as well as the agent learning 
capabilities. Moreover, we do so in a new general setting where 
there might not be a “true state of the world” which is perfectly 
described by a single hypothesis, i.e. misspecified models. 
Additionally, we analyze networks where agents might have 
conflicting hypotheses, i.e. the hypotheses with the highest 
confidence changes if differenf subsefs of agents are taken into 


account. The algorithm for fixed undirected graphs achieves a 
factor of n improvement in the convergence rate with respect 
to the number of agents in comparison with that of the existing 
algorithms. 

Our work suggests a number of open questions. The prob¬ 
lem of tracking optimal hypothesis when its distributions are 
changing with time requires further study lISSll . Ideas from 
social sampling can also be incorporated in this framework 
1591 . where the dimension of the beliefs is large and only par¬ 
tial beliefs are transmitted. Moreover, studying the influence of 
corrupted measurements or malicious agents is also of interest, 
especially in the setting of social networks. 
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